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Abstract 

We consider the least-square linear regression problem with regularization by the 
£ 1 -norm, a problem usually referred to as the Lasso. In this paper, we first present 
a detailed asymptotic analysis of model consistency of the Lasso in low-dimensional 
settings. For various decays of the regularization parameter, we compute asymptotic 
equivalents of the probability of correct model selection. For a specific rate decay, we 
show that the Lasso selects all the variables that should enter the model with probability 
tending to one exponentially fast, while it selects all other variables with strictly positive 
probability. We show that this property implies that if we run the Lasso for several 
bootstrapped replications of a given sample, then intersecting the supports of the Lasso 
bootstrap estimates leads to consistent model selection. This novel variable selection 
procedure, referred to as the Bolasso, is extended to high-dimensional settings by a 
provably consistent two-step procedure. 

1 Introduction 

Regularization by the ^-norm has attracted a lot of interest in recent years in statistics, 
machine learning and signal processing. In the context of least-square linear regression, the 
problem is usually referred to as the Lasso [36] or basis pursuit [1 I]. Much of the early 
effort has been dedicated to algorithms to solve the optimization problem efficiently, either 
through first-order methods [20, 19], or through homotopy methods that leads to the entire 
regularization path (i.e., the set of solutions for all values of the regularization parameters) 
at the cost of a single matrix inversion [29, 35, 16]. 

A well-known property of the regularization by the £ 1 -norm is the sparsity of the solutions, 
i.e., it leads to loading vectors with many zeros, and thus performs model selection on top of 
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regularization. Recent works [44, 40, 45, 37] have looked precisely at the model consistency of 
the Lasso, i.e., if we know that the data were generated from a sparse loading vector, does the 
Lasso actually recover the sparsity pattern when the number of observations grows? In the 
case of a fixed number of covariates (i.e., low-dimensional settings), the Lasso does recover 
the sparsity pattern if and only if a certain simple condition on the generating covariance 
matrices is satisfied [40]. In particular, in low correlation settings, the Lasso is indeed con- 
sistent. However, in presence of strong correlations between relevant variables and irrelevant 
variables, the Lasso cannot be model-consistent, shedding light on potential problems of such 
procedures for variable selection. Various extensions of the Lasso have been designed to fix 
its inconsistency, based on thresholding [34], data- dependent weights [45, 40, 26] or two-step 
procedures [31]. The main contribution of this paper is to propose and analyze an alternative 
approach based on resampling. Note that recent work [33] has also looked at resampling 
methods for the Lasso, but focuses on resampling the weights of the £ 1 -norm rather than 
resampling the observations (see Section 3 for more details). 

In this paper, we first derive a detailed asymptotic analysis of sparsity pattern selection 
of the Lasso estimation procedure, that extends previous analysis [44, 40, 45] by focusing on 
a specific decay of the regularization parameter. Namely, in low- dimensional settings where 
the number of variables p is much smaller than the number of observations n, we show that 
when the decay of n is proportional to n~ 1//2 , then the Lasso will select all the variables that 
should enter the model (the relevant variables) with probability tending to one exponentially 
fast with n, while it selects all other variables (the irrelevant variables) with strictly positive 
probability. If several datasets generated from the same distribution were available, then 
the latter property would suggest to consider the intersection of the supports of the Lasso 
estimates for each dataset: all relevant variables would always be selected for all datasets, 
while irrelevant variables would enter the models randomly, and intersecting the supports 
from sufficiently many different datasets would simply eliminate them. However, in practice, 
only one dataset is given; but resampling methods such as the bootstrap are exactly dedicated 
to mimic the availability of several datasets by resampling from the same unique dataset [17]. 
In this paper, we show that when using the bootstrap and intersecting the supports, we 
actually get a consistent model estimate, without the consistency condition required by the 
regular Lasso. We refer to this new procedure as the Bolasso (bootstrap-enhanced least 
absolute shrinkage operator). Finally, our Bolasso framework could be seen as a voting 
scheme applied to the supports of the bootstrap Lasso estimates; however, our procedure 
may rather be considered as a consensus combination scheme, as we keep the (largest) subset 
of variables on which all regressors agree in terms of variable selection, which is in our case 
provably consistent and also allows to get rid of a potential additional hyperparameter. 

We consider the two usual ways of using the bootstrap in regression settings, namely boot- 
strapping pairs and bootstrapping residuals [17, 18]. In Section 3, we show that the two types 
of bootstrap lead to consistent model selection in low-dimensional settings. Moreover, in Sec- 
tion 5, we provide empirical evidence that in high-dimensional settings, bootstrapping pairs 
does not lead to consistent estimation, while bootstrapping residuals still does. While we are 
currently unable to prove the consistency of bootstrapping residuals in high-dimensional set- 
tings, we prove in Section 4 the model consistency of a related two-step procedure: the Lasso 
is run once on the original data, with a larger regularization parameter, and then bootstrap 
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replications (pairs or residuals) are run within the support of the first Lasso estimation. We 
show in Section 4 that this procedure is consistent. In order to do so, we consider new sufficient 
conditions for the consistency of the Lasso, which do not rely on sparse eigenvalues [34, 41], 
low correlations [12, 27] or finer conditions [6, 15, 42]. In particular, our new assumptions 
allow to prove that the Lasso will select not only a few variables when the regularization 
parameter is properly chosen, but always the same variables with high probability. 

In Section 5.1, we derive efficient algorithms for the bootstrapped versions of the Lasso. 
When bootstrapping pairs, we simply run an efficient homotopy algorithm, such as Lars [16], 
multiple times; however, when bootstrapping residuals, more efficient ways may be designed 
to obtain a running time complexity which is less than running Lars multiple times. Finally, in 
Section 5.2 and Section 5.3, we illustrate our results on synthetic examples, in low-dimensional 
and high-dimensional settings. This work is a follow-up to earlier work [1]: in particular, it 
refines and extends the analysis to high- dimensional settings and boostrapping of the residuals. 

Notations For x £ MP and q > 0, we denote by \\x\\ q its ^-norm, defined as \\x\\^ = 
Yli=i \ x i\ q - We also denote by ||o;||oo = max !g { lr .. iP } \x%\ its £°°-norm. For rectangular matrices 
A, we denote by || A\\ 2 its largest singular value, ||^4||oo the largest magnitude of all its elements, 
and ||^4||f = (tiA T A) 1 / 2 its Frobenius norm. We let denote A max (Q) and A min ((5) the largest 
and smallest eigenvalue of a symmetric matrix Q. 

For a6R, sign(a) denotes the sign of a, defined as sign(a) = 1 if a > 0, —1 if a < 0, and 
if a = 0. For a vector v £ MP, sign(w) £ { — 1, 0, 1} P denotes the vector of signs of elements 
of v. Given a set H, 1# is the indicator function of the set H. We also denote, for w £ MP, 
by m(w) = min J - e { lv jP } w .^ \ wj\, the smallest (in magnitude) of non-zero elements of w. 

Moreover, given a vector v £ 1R P and a subset / of {l,...,p}, vj denotes the vector 
in R Card ( / ) of elements of v indexed by /. Similarly, for a matrix A £ M pxp , Aj t j denotes 
the submatrix of A composed of elements of A whose rows are in / and columns are in J. 
Moreover, \J\ denotes the cardinal of the set J. For a positive definite matrix Q of size p, 
and two disjoint subsets of indices A and B included in {1, . . . ,p}, we denote Qa,a\b the 
matrix Qa,a — Qa,bQ^bbQb,Ai which is the conditional covariance of variables indexed by A 
given variables indexed by B, for a Gaussian vector with covariance matrix Q. Finally, we let 
denote P and E general probability measures and expectations. 

Least-square regression with £ 1 -norm penalization Throughout this paper, we con- 
sider n pairs of observations (xi,yi) £ MP xl, i = 1, . . . , n. The data are given in the form 
of a vector y £ M n and a design matrix X £ IR nxp . We consider the normalized square loss 
function 

1 n 1 
— / (% — w 1 Xi) 2 = — \\y — Xw\\1, 
2n^ yy ' 2n ny ll2 ' 

i=l 

and the regularization by the £ 1 -norm. That is, we look at the following convex optimization 
problem [36, 14]: 

m ™7^\\y - Xw\\ 2 2 + fi\\w\\!, (l.i) 
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where /i ^ is the regularization parameter. We denote by w any global minimum of Eq. (1.1), 
and J — {j G {1, . . . ,p}, Wj ^ 0} the support of w. 

In this paper, we consider two settings, depending on the value of the ratio oip/n. When 
this ratio is much smaller than one, as in Section 2, we refer to this setting as low-dimensional 
estimation, while in other cases, where this ratio is potentially much larger than one, we refer 
to this setting as a high-dimensional problem (see Section 4). 

2 Low-Dimensional Asymptotic Analysis 

We make the following "fixed-design" assumptions: 

(Al) Linear model with i.i.d. additive noise: y = Xw+e, where e is a vector with independent 
components, identical distributions and zero mean; w is sparse, with s = sign(w) and 
support J = {j, Wj 7^ 0}. 

(A2) Subgaussian noise: there exists r > such that for all j G {l,...,p} and s G K, 
Ee S£j ^ ea r s . Moreover, the variances of E j £1X6 greater than a 2 > 0. 

(A3) Bounded design: For all % G {1, . . . ,n}, ||£i||oo ^ M. 

(A4) Full rank design: The matrix Q = ^X T X G R pxp is invertible. 

Throughout this paper, we consider normalized constants w = wM/a (normalized pop- 
ulation loading vector), jl = ji/Ma (normalized regularization parameter), A = A min (Q)/M 2 
(condition number of the matrix of second-order moments), and f — r/a (always larger than 
one, and equal to one if and only if the noise is Gaussian, see Appendix A. 2). 

With our assumptions, the problem in Eq. (1.1) is equivalent to 

min -(w — w) T Q(w — w) — q T (w — w) + (2.1) 

where Q = ^X T X G R pxp and q = ^X T e G W. Note that under assumption (A4), there is 
a unique solution to Eq. (1.1) and Eq. (2.1), because the associated objective functions are 
then strongly convex. Moreover, assumption (A4) implies that p ^ n, that is, we consider in 
this section, only "low-dimensional" settings (see Section 4 for extensions to high-dimensional 
settings). 

In this section, we detail the asymptotic behavior of the (unique) Lasso estimate w, both 
in terms of the difference in norm with the population value w (i.e., regular consistency) and 
of the sign pattern sign(w), for all asymptotic behaviors of the regularization parameter /i. 
Note that information about the sign pattern includes information about the support J, i.e., 
the indices j G {1, . . . ,p} for which Wj is different from zero; moreover, when w is consistent, 
consistency of the sign pattern is in fact equivalent to the consistency of the support. We 
assume that p is fixed and n tends to infinity, the regularization parameter /i being considered 
as a function of n (though we still derive non-asymptotic bounds). 
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Note that for some of our results to be non trivial, we require that p is not only small 
compared to n, but that a power of p is small compared to n. Technically, this is due to 
the application of multivariate Berry-Esseen inequalities (reviewed in Appendix A.l), which 
could probably be improved to obtain smaller powers. 

We consider five mutually exclusive possible situations which explain various portions of 
the regularization path; many of these results appear elsewhere [40, 44, 21, 45, 2, 27] but some 
of the finer results presented below are new (in particular most non-asymptotic results and 
the n~ 1//2 -decay of the regularization parameter in Section 2.4). These results are illustrated 
on synthetic examples in Section 5.2. 

Note that all exponential convergences have a rate that depends on m(w), i.e., the smallest 
(in magnitude) non zero element of the generating sparse vector w. Thus, we assume a sharp 
threshold in order to have a fast rate of convergence. Considering situations without such 
a threshold, which would notably require to estimate errors in model estimation (and not 
simply exactly correct or incorrect), is out of the scope of this paper (see, e.g., [41]). 

2.1 Heavy regularization 

If /i is large enough, then w is equal to zero with probability tending to one exponentially 
fast in n. Indeed, we have (see proof in Appendix D.l): 



Proposition 2.1. Assume (Al-4). If jl^ 2||w|| 1; then the probability that w = is greater 
than 1 - 2pexp f-^CV 



A well-known property of homotopy algorithms for the Lasso (see, e.g., [16]) is that if fi 
is large enough, then w = 0. This proposition simply provides a uniform probabilistic bound. 

2.2 Fixed regularization 

If /i tends to a finite strictly positive constant /io, then w converges in probability to the 
unique global minimum of the noiseless objective function |(u> — w) T Q(w — w) + /x |l w lli- 
Thus, the estimate w never converges in probability to w, while the sign pattern tends to 
the one of the previous global minimum, which may or may not be the same as the one of 
the noiseless problem w. It is thus possible, though not desirable, to have sign consistency 
without regular consistency. See [2] for examples and simulations of a similar behavior for 
the group Lasso. 

All convergences are exponentially fast in n (proof in Appendix D.2). Note that here 
and in the next regime (Proposition 2.3), we do not take into account the pathological cases 
where the sign pattern of the limit in unstable, i.e., the limit is exactly at a hinge point of the 
regularization path. When this occurs, all associated sign patterns are attained with positive 
probability (see also Section 4). 

Proposition 2.2. Assume (Al-4:). Let fi > and ftQ = [Iq/ M / sigma. Let Wq be the unique 
solution o/min^gKp |(t> — w) T Q(v — w) + /i |H|i. Then, if \fi — £l \ ^ jrfiP, we have: 
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Moreover, assume the minimum v occurs away from a hinge point of the regularization path, 
i.e., there exists rj > such that for all j G {l,...,p}, Vj = implies \(Q(wq — w))j| ^ 
fio — r]Ma. If\jl — jlo\ ^ A min{?7/4, m(woM/a)}, then 

( A 2 n\ 
P(sign(w) ^ sign(u7 )) ^ 2pexp I — — mm{r] 2 /A, m(woM/a) 2 } — J . 

The proposition above makes no claim in the situation where \i tends to zero. As we now 
show, this depends on the rate of decay of /x, slower, faster, or exactly at the rate n~ x l 2 . 



2.3 High regularization 

If n tends to zero slower than n~ 1//2 , then w converges in probability to w (regular consistency) 
and the sign pattern converges to the sign pattern of the global minimum of a local noiseless 
objective function ^A T QA + Ajsign(wj) + HA^Hx, the convergence being exponential in /i 2 n 
(see proof in Appendix D.3). The local noiseless problem in Eq. (2.2) is simply obtained by 
a first-order expansion of the Lasso objective function around w [21, 40]. 

Proposition 2.3. Assume (Al-4). Let A be the unique solution of 

min ^A T QA + Ajsign(wj) + ||A JO ||i. (2.2) 

Assume that [i ^ ™^ 2 A . We have: 



w-w- iiA\\ 2 ^ (3o-/M) ^ 2pexp 



X 2 (3 2 n 
8f 2 p 



Moreover, assume the minimum A of Eq. (2.2) occurs away from a hinge point of the regular- 
ization path, i.e., there exists rj > such that for all j G 3 C , Aj = implies \(QA)j\ ^ 1 — rj. 
Then, 



m(w)A 2 n , 

^sign(w) fi sign(w + /iA)) ^ 2pexp [ — - — — ) + 2pexp ( —A\i 



5T p J \ p 



where A = t~ 2 X min{Am(M 2 A) 2 /2, r/ 2 /8}. 



Note that the sign pattern of w + /iA is equal to the population sign vector s = sign(w) 
if and only if the problem in Eq. (2.2) has a solution where Aj c is equal to zero. A short 
calculation shows that this occurs if and only if the following consistency condition is satis- 
fied [32, 44, 40, 45, 37]: 

IIQjvQj>ign(w J )|| 00 ^ 1. (2.3) 

Thus, if Eq. (2.3) is satisfied strictly — which implies that we are not at a hinge point of 
Eq. (2.2) — the probability of correct sign estimation is tending to one, and to zero if Eq. (2.3) 
is not satisfied (see [40] for precise statements when there is equality). Moreover, when 
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Eq. (2.3) is satisfied strictly, Proposition 2.3 gives an upper bound on the probability of not 
selecting the correct pattern J. 

The first three regimes are not unique to low-dimensional settings; we show in Section 4 the 
corresponding proposition related to Proposition 2.3, for high-dimensional settings. However, 
the last two regimes (/i tending to n l / 2 or faster) are specific to low-dimensional 

settings. 



2.4 Medium regular izat ion 

If /m 1 / 2 is bounded from above and from below, then we show that the sign pattern of 
w agrees on J with the one of w with probability tending to one exponentially fast in n 
(Proposition 2.4), while for all sign patterns consistent on J with the one of w, the probability 
of obtaining this pattern is tending to a limit in (0, 1) (in particular strictly positive); that is, 
all sign patterns consistent with w on the relevant variables (i.e., the ones in J) are possible 
with positive probability (Proposition 2.5). The convergence of this probability follows a rate 
of n -1 / 2 (see proof in Appendix D.4 and D.5). Note the difference with earlier results [1] 
obtained for random designs. 

Proposition 2.4. Assume (Al-4) and Ji «C ^r/r- Then for any sign pattern s G { — 1,0, 1} P 
such that sj = sign(wj), there exists f(s, n 1 / 2 /!^ 1 / 2 ) G (0, 1), such that: 

, m / , ^ x ,/ i/9 i/9m 4C? E f 3 p 2 ( m(w)A 2 n 
P sign (w) = s - / sW 7 * ^ 1 \ + 2pexp K -J 

Proposition 2.5. Assume (Al-4) and jl ^ %^jt ■ Then, for any pattern s G {—1,0, 1} P 
such that sj ^ sign(wj), 



(sign(w)) = s) ^ 2pexp 



m(w)A n 

8f 2 p 



The positive real numbers Cf E and Cf E are universal constants related to multivari- 
ate Berry- Esseen inequalities (see Appendix A.l for more details). From the proof in Ap- 
pendix D.4, the constant f(s,c) has specific behaviors when c = /in 1,/2 p 1//2 is small or large: 
on the one hand, if c tends to infinity, then we tend to the bevahior of the previous section, 
that is, f(s, c) tends to one if s is the limiting pattern in Proposition 2.3 and zero otherwise. 
On the other hand, if c tends to 0, f(s, c) tends to one if s has no zeros, and zero otherwise 
(see next section). 

The last two propositions state that the relevant variables are stable, i.e., we get all relevant 
variables with probability tending to one exponentially fast, while we get exactly get all other 
patterns with probability tending to a limit strictly between zero and one. This stability of 
the relevant variables is the source of the intersection arguments outlined in Section 3. 

Note that Proposition 2.4 makes non-trivial statements only for n larger than p 4 ; the 
fourth power is due to the application of Berry-Esseen inequalities, and could be improved. 
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2.5 Low regularization 



If fx tends to zero faster than n _1//2 , then w is consistent (i.e., converges in probability to 
w) but the support of w is equal to {1, . . . ,p} with probability tending to one (the signs of 
variables in J c may then be arbitrarily negative or positive). That is, the £ 1 -norm has no 
sparsifying effect. We obtain two different bounds, with different scalings in p and n (see 
proof in Appendix D.6): 

Proposition 2.6. Assume (A.l-4:) and fx ^ ^t)t ■ Then the probability of having at least one 
zero variable is smaller than V (cf E ig, ^ + and ^ + ±ff^ + C 2 BE |J^ + 

2|J|exp(-=£^;). 

The first bound simply requires that fx tends to zero faster than n _1//2 , but the constant 
is exponential in p, while the second bound required that fx does not tend to zero too fast, 
i.e., between n~ 1 / 2 and n~ l (with constants polynomial in p). As shown in Appendix D.3, 
the two bounds correspond to two different applications of Berry-Esseen inequalities, one for 
all the possible 3 P sign patterns, one using a detailed analysis of the non-selection of a given 
variable (see Section 2.6). We are currently exploring the possibility of having a bound that 
shares the positive aspects of our two bounds — polynomial in p and without the term (fxn)~ l . 

Among the five previous regimes, the only ones with consistent estimates (in norm) and 
a sparsity-inducing effect are fx tending to zero and fin 1 ! 2 tending to a finite or infinite limit. 
When im 1 ! 2 tends to infinity, we can only hope for model consistent estimates if the consis- 
tency condition in Eq. (2.3) is satisfied. This somewhat disappointing result for the Lasso has 
led to various improvements on the Lasso to ensure model consistency even when Eq. (2.3) is 
not satisfied [40, 45, 31]. Those are based on adaptive weights based on the non regularized 
least-square estimate or two-step procedures. We propose in Section 3 alternative ways which 
are based on resampling. Before doing so, we derive in the next section finer results that 
allows to consider the presence or absence in the support set J of a specific variable without 
considering all corresponding consistent sign patterns. 



2.6 Probability of not selecting a given variable 

We can lower and upper bound the probability of not selecting a certain irrelevant variable 
in J c (see proof in Appendix D.7) — see Proposition 2.5 for a related proposition for relevant 
variables in J: 

Proposition 2.7. Assume (Al-4) and fx ^ ^rfr- Let j £ J c . We have: 



fln 1 / 2 /2\ 1 / 2 V A 2 / f 3 A x fin X^/^n 1 / 2 ' 
. fin 1 ' 2 8C 2 BE p 5 / 2 ^ BE 4r 3 p 2 



A 1 / 2 f 3 A fin X 1 / 2 n 1 / 2 ' 



S 



This novel proposition allows to consider "marginal" probabilities of selecting (or not 
selecting) a given variable, without considering all consistent sign patterns associated with 
the selection (or non-selection) of that variable). Note that it makes interesting claims only 
when jjLn 1 / 2 is bounded from above and below (for the lower bound) and when /in 1 / 2 tends to 
zero, while fin tends to infinity (for the upper bound). 



3 Support Estimation by Intersection 

The results from Section 2.4 exactly show that under suitable choices of the regularization 
parameter /i, the relevant variables are stable while the irrelevant are unstable, leading to sev- 
eral intersecting arguments to keep only the relevant variables. We first consider the irrealistic 
situation where we have multiple independent copies, then we consider splitting a dataset in 
several pieces, and we finally present two usual types of bootstrap (pairs and residuals). Note 
that an alternative approach is to resample the columns of the design matrix instead of its 
rows, i.e., draw random weights for each variable from a well-chosen distribution [33]. 

The analysis of support estimation is essentially the same for all methods and is based on 
the following argument: we consider m "replications", and J 1 , . . . , J m the associated active 
sets. The replications are assumed independent given the original data (i.e., the vector of noise 
e). We let denote J n = f)™L 1 J 1 the estimate of the active set (given the original data). Once 
the active set is found, the final estimate of w is obtained by the unregularized least-square 
estimate, restricted to the estimated active set. 

We can upper bound the probability of incorrect pattern selection as follows: 

p(j n ^j) < p(j c niV0) + P(Jn(i n ) c /0), 

(m 
U [( ry uj]/0 

^ 5^E(P(j G J*\e) m ) +mP((J*) c U J ^ 0), 

where J* denotes a generic support obtained from one replication. We now need to upper 
bound the probability P((J*) C U J 7^ 0) of forgetting at least a relevant variable j G J, and 
also the probability P(j G J*\e) that a replication does not include a given irrelevant variable 
j G J c (given the original data). The first term will always drop as the number of replications 
gets larger, while the second term increases, leading to a natural trade-off for the choice of the 
number m of replications. This is to be contrasted with usual applications of the boostrap 
where m is taken as large as computationally feasible. 



3.1 Multiple independent copies 

Let us assume for a moment that we have m independent copies of similar datasets, with 
potentially different fixed designs but same noise distribution. We then have m different active 
sets and we denote by J n the intersection of the m active sets. We have the following upper 
bound on the probability of non selecting the correct pattern (see proof in Appendix D.8) 
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Proposition 3.1. Assume (Al-4) form independent datasets with same noise distribution, 

and ft ^ ytjt ■ If c = ftn l l 2 p l l 2 > 0, ft ^ ytjt and n ^ p G g(c), then there exists /(c) > 
such that 

m(w)A 2 n \ 
8f 2 p J ' 

From the proof of Proposition 3.1 in Appendix D.8, we can get the detailed behavior of 
/(c) around c = and c = oo: it goes to zero in both cases, i.e., we actually need (in the 
bound) a regularization parameter that is proportional to n~ 1 ^ 2 . 

Moreover, we get an exponential convergence rate in n and m, where we have two parts: 
one that states that the number of copies should be as large as possible to remove irrelevant 
variables (left part), and one that states that m should not be too large, otherwise, some 
relevant variables would start to disappear (right part). Note that best scaling (for the bound) 
is m ~ n, leading to a probability of incorrect selection that goes to zero exponentially fast 
in n. 

Of course, in practice, one is not given multiple independent copies of the same datasets, 
but a single one. One strategy is to split it in different pieces, as described in Section 3.2; 
this however relies on having enough data to get a large number of pieces, which is unlikely 
to happen in practice. Our main goal is this paper is to show that by using the bootstrap, 
we can mimic the availability of having multiple copies. This will come at a price, namely an 
overall convergence rate of n~ 1//2 instead of exponential in n 

3.2 Splitting into pieces 

We can cut the dataset into m pieces of the same size, a procedure reminiscent of cross- 
validation. However, it requires extra-assumption on the design, i.e., we need to assume that 
the smallest eigenvalues of the data matrices of length n/m are still strictly positive (see proof 
in Appendix D.9): 

Proposition 3.2. Assume (Al-4) for m disjoint subdatasets of the original dataset, and 
I 1 ™pT/l A ■ If c — ftn 1 / 2 m~ 1 / 2 p 1 / 2 > 0, there exists /(c), a(c) > such that: 

* J) <P (l - e->^>% Hcf^Pjl 2 P »exp . 

The proposition above requires that m/n tends to zero, i.e., there should not be too many 
pieces (which is also required to allow invertibility of the sub-designs). Note that several 
independent partitions could be considered, and would lead to results similar to the ones for 
the bootstrap presented in the next two sections [33]. 

3.3 Random pair bootstrap 

Given the n observations (xj, jji) e MP x R, i — 1, . . . , n, put together into matrices X e lR nxp 
and y G M n , we consider m bootstrap replications of the n data points [17]; that is, for 



P( J n ^ J) < pe 



-f(c)mp 1 I 2 



+ 2pm exp 
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k = 1, . . . , m, we consider a ghost sample (x k , y k ) E W x R, i = 1, . . . , n, given by matrices 
X k E M. nxp and y k E W l . For each k E {1, . . . ,m}, the n pairs (x k ,y k ), i = 1, . . . ,n, are 
sampled uniformly and independently at random with replacement from the n original pairs 
in (X, y). Some pairs (xi, yi) are not selected, some selected once, some selected twice, and so 
on. Note that we could consider bootstrap replications with more or less points than n, but 
for simplicity, we keep it the same as the original number of data points. 

The following proposition shows that we obtain a consistent model estimate by intersect- 
ing the active sets J 1 , . . . , J m obtained from running the Lasso on each bootstrap sample 
(X 1 , y 1 ), . . . , (X m , y m ), a procedure we refer to as the Bolasso (see proof in Appendix E): 

Proposition 3.3. Assume (AA-4:). If c = p/n^p 1 ' 2 > 0, there exists strictly positive con- 
stants A , . . . , A-j that may depend on c such that if np~ 6 ^ Aq and mp~ l ^ A 7 , we have, for 
boostrapping pairs: 

1/2 \ / 3 1 \ l+^ 5 f21ogfA 3 ^L + i2^i XX 

n ' \ ( p logm\ \ \ 



P(J n ^ J) ^ mpexp^-Ao'^J +A a (a 



n 1 / 2 m 



Note that in Proposition 3.3, for any n > 0, if n and m are large enough, then we get 
an upper bound on the probability of incorrect model selection of the form Bime~ B2nl/2 + 
(~T72 + -^4^r0 T ' \ where Bi, . . . , £? 4 are positive constants. Note that in [1], we have derived 
a bound with better behavior in n, i.e., with rj = 0. However, the bound in [1] holds for 
random designs and has constants which scale exponentially in p and not polynomially. We 
are currently trying to improve on the bound in Proposition 3.3 to remove the extra factor 

7] > 0. 

As before, the number of replications should be as large as possible to remove irrelevant 
variables, and m should not be too large, otherwise, some relevant variables would start to 
disappear from the intersection. Note that best scaling (for the bound) is m ~ n 1 ^ 2 , leading 
to an overall probability of incorrect model selection that tends to zero at rate n~ 1//2 , instead 
of the exponential rate for the irrealistic situation of having multiple copies (Section 3.1). 

We have not explored yet the optimality (in the minimax sense) of the bound given in 
Proposition 3.3. While we believe that a rate of n -1 / 2 cannot be improved upon, the rate p 6 
should be improved with further research. 

Finally, we have explored in [1] the possibility of considering softer ways of performing 
the intersection, i.e., by keeping all variables that appear in a certain proportion of the active 
sets corresponding to the various replications. This is important in cases where the decay of 
the loading vectors does not have sharp threshold as assumed in most analyses (this paper 
included). However, it adds an extra hyper-parameter and the theoretical analysis of such 
schemes is out of the scope of this paper. 



3.4 Boostrapping residuals 

An alternative to resampling pairs (xi,yi) is to resample only the estimated centered residu- 
als [17, 18]. This is well adapted to fixed-design assumptions, in particular because the design 
matrix X remains the same for all replications. Note however, that the consistency of this 
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resampling scheme usually relies more heavily on the homoscedasticity assumption (A2) that 
we make in this paper [18]. Moreover, since the Lasso estimate is biased, the behavior differs 
slightly from bootstrapping pairs, as shown empirically in Section 5. 

Bootstrapping residuals works as follows; we let denote £j = ?/j — w T Xi — S{ — (w — w) T x,/ 
the vector of estimated residuals, and e$ the centered residuals equal to £j = £j — - X!fc=i ■ 
When bootstrapping residuals, for each i G {1, . . . , n}, we keep Xi unchanged and we use as 
data y* = w T Xi + eV, where i* is a random index in {1, . . . , n} — the sampling is uniform and 
the n indices are drawn independently. 

We obtain a similar bound than when bootstrapping pairs (see proof in Appendix F.2): 

Proposition 3.4. Assume (Al-4^). If c = jln l / 2 p 1 / 2 > 0, there exists strictly positive con- 
stants A , . . . , A 7 that may depend on c such that if np~ 6 ^ A$ and mp~ l ^ A 7 , we have, for 
boostrapping residuals: 



The bound in Proposition 3.4 is the same as bootstrapping pairs, but as shown in Ap- 
pendix F.2, the constants are slightly better). However, as shown in Section 5.3, the behaviors 
of the two methods differ notably: random-pair bootstrap does not lead to good selection per- 
formance in high-dimensional settings, while residual bootstrap does. While we are currently 
unable to proof the consistency of bootstrapping residuals in high-dimensional settings, we 
prove in Section 4 the model consistency of a related two-step procedure, where the bootstrap 
replications are performed within the support of the Lasso estimate on the full data. 



4 High-Dimensional Analysis 

In high-dimensional settings, i.e., when p may be larger than n, we need to change assumption 
(A4) regarding the invertibility of the empirical second order moment, which cannot hold. 
Various assumptions have been used for the Lasso, based on low correlations [27], sparse eigen- 
values [34] or more general conditions [6, 15]. In this paper, we introduce a novel assumption, 
which not only allows us to consider that the support of the Lasso estimate has a bounded 
size, but also implies that we obtain the same sign pattern with high probability. The analysis 
carried out in low- dimensional settings in Section 2.3 is thus also valid in high-dimensional 
settings. 



4.1 High-dimensional assumptions 

Our analysis relies on the analysis carried out in Section 2.3 for "high" regularization, i.e., 
when jj, tends to zero slower than n~ 1//2 . In this setting, we have shown that the Lasso estimate 
asymptotically behaves as w + fiA, where A is the unique minimum of 

min ^A T QA + A}sign( Wj ) + HA^Hl (4.1) 
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We let denote K C J c the "extended" support of a solution A JC of Eq. (4.3) and L = J fl K: 
that is, we not only keep all indices corresponding to non zero elements of Aj c , but also the 
ones for which the optimality condition in Eq. (C.l) is an equality (i.e., if we are at a hinge 
point of the regularization path, we take all involved variables) 

We consider the vector t G { — 1, 0, defined by tj = sign(wj) and t JC = sign(A JC ). If 
we assume that A min ((5L,L) > 0, then the solution to Eq. (4.1) is unique [22], and is such that 
Al = — Qll^l an d the optimality conditions for Eq. (4.1) are simply 

sign(-[Q£ 1 L t L ] K ) =t K and HQlclQE^lIIoo ^ 1. 

We make the following assumptions (note that (A6) is essentially equivalent to the lack 
of hinge point which is also made in Proposition 2.3): 

(A5) Unicity of local noiseless problem: the matrix Ql.l is invertible. 

(A6) Stability of local noiseless problem: ||Ql c lQll^lII°° < 

We let denote 

6 = min <1- ||Ql c lQll*lII°o ? min \ (Qi\^'L)kQk,k\ >, (4.2) 
^ fceK J 

the quantity that will characterize the stability of the local noiseless problem; if (A5-6) are 
satisfied, then 6 > 0. As shown in Proposition 4.1, the quantity 6 dictates the speed of 
convergence of the probability of not getting t as a sign pattern for the Lasso problem in 
Eq. (1.1) or Eq. (2.1). 

Comparison with consistency condition We now relate (A6) with the consistency con- 
dition for the Lasso in Eq. (2.3): if Eq. (2.3) satisfied, then K = and the condition (A6) 
simply becomes: 

llQj c ,J<2j,jSign(wj)||oo < 1, 
which is exactly a strict version of Eq. (2.3) — an assumption commonly made for high- 
dimensional analysis of the Lasso [44, 37]. Note that we then have the simplified expression 
0=1- ||Qj CJ (5j ! 1 J sign(wj)|| 00 . 

The main goal of this paper is to design a consistent procedure even when Eq. (2.3) is 
not satisfied. As we have seen, (A6) is weaker than the usual assumptions made for the 
Lasso consistency; in Figure 1 (left and middle), we compare empirically the two conditions 
for random i.i.d. Gaussian designs, showing that our set of assumptions is weaker, but of 
course breaks down when n is too small (too few observations) or the cardinal of J is too 
large (too many relevant variables). We are currently exploring theoretical proofs of this 
behavior, extending the current analysis of [37] for Eq. (2.3); in particular, we aim at de- 
termining the various scalings between p, n and the number of relevant variables for which 
a Gaussian ensemble leads to consistent variable selection with high probability (according 
to our assumptions which are weaker than in [37]). Moreover, in the right plot of Figure 1, 
we show values of log 6 for various n and | J | , which characterize the convergence rate of our 
bound. Relying on 6 which is bounded from below is clearly a weakness of our approach to 
high- dimensional estimation; we are currently exploring refined conditions where we relax the 
stability, i.e., we allow several (but not too many) patterns to be selected with overwhelming 
probability. 
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Figure 1: Consistency conditions for random Gaussian designs, p = 128, n from 40 to 200 
and | J | from 1 to 20 (all probabilities and averages obtained from 1000 replications). Left: 
probability that Eq. (2.3) is satisfied. Middle: probability that (A6) is satisfied. Right: 
expectation of log (plotted only for the ones for which the local problem is unique with high 
probability) . 



Checking assumptions (A5-6) In Eq. (4.1), we can optimize in closed form with respect 
to Aj as Aj = Qj j(— sign(wj) — <3j,j c Aj c ), leading to an optimization problem for A JC : 

min ^Aj c Q JC; jc|jAjc - A] c QjcjQ~ jsign(wj) + ||Ajc||i, (4.3) 

which can be solved using existing code for the Lasso. We are currently working on deriving 
sufficient conditions which do not depend on the sign pattern of the population loading w 
(but only on the sparsity pattern, or even its cardinality), as usually done for the consistency 
condition in Eq. (2.3) [44, 40]. 



4.2 Stability of sign selection 

With assumptions (A5) and (A6), we can show that with high-probability, when the regu- 
larization parameter is asymptotically greater than n _1//2 , then the sign of the Lasso estimate 
is exactly t (see proof in Appendix G): 

Proposition 4.1. Assume (Al-3), (A5-6), and p, ^ ^gffi . Then: 

na 2 2 \ L \ .,, ( nm(w) 2 A?\ 
- g f2|L| 1 + 2| J| exp I . (4.4) 

Note that if 6 is bounded away from zero, then we simply need that \ogp = o(n) for our 
result to hold. Moreover, in Eq. (4.4), we can see that dictates the asymptotic behavior of 
our bound. If it is too small, then in order to have a meaningful bound for this design matrix, 
we would need to consider sign patterns which are close to t and show that the sign pattern 
of the Lasso estimate w is with high probability within these sign patterns. 
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4.3 High-dimensional Bolasso 



Proposition 4.1 suggests to run the Lasso once with a larger regularization parameter (i.e., 
multiplied by logp) and run the various resampling schemes within the active set of the 
original Lasso estimation (which is very likely to be the support associated with t). More 
precisely, we have the proposition (see proof in Appendix G): 

Proposition 4.2. Assume (Al-3) and (A5-6). If c = fln 1 ^ 2 ]!^] 1 ^ 2 > 0, there exists strictly 
positive constants Aq, . . . , Af that may depend on c such that ifn\Li\~ 6 ^ A$ andm\L\~ l ^ Aj, 
we have, for boostrapping residuals: 

c 2 (logp) 2 2 A L \ 1T1 / nm(w) 2 A 2 ^ 
~ g~ 2 | L | 2 + 21 Jl exp [-^4t 



L |l/2 / M ^1/2 



n 1/2 \ A ( A |L| 3 logm^ 1+ ^l 21og l A3 ^ + 
mpexp ( -Aqj—^ ) + A A \ A 3 



log m 



-1/2 



Note that the constants depend polynomially on |L| and A m i n (QL,L), and do not depend 
on p. This is thus a high-dimensional result where p may grow large compared to n. If we 
relax (A6), then the original Lasso estimate would have a small set of allowed patterns with 
high probability (instead of simply one), and a union bound considering all those would need 
be considered. 



5 Algorithms and Simulations 

In this section, we describe efficient algorithms for the boostrapped versions of the Lasso that 
we present in this paper and we illustrate the various consistency results obtained in previous 
sections, in low-dimensional and high- dimensional settings. 

5.1 Efficient Path Algorithms 

We first consider efficient algorithms for the boostrapping procedures, based on homotopy 
methods [35, 16, 23]. Similar developments could be made for first-order methods [20, 19]. 
For the regular Lasso, one can find the solutions of Eq. (1.1) for all values of the regularization 
parameter fi that correspond to less than k selected covariates in time which is empirically 
0(pn + k 2 n): indeed, computing -X T y once is 0(pn), while computing the relevant elements 
of Q = ^X T X and updating various quantities is 0{k 2 n). Note that our analysis suggests to 
stop the path when the solution of the problem is not unique anymore, i.e., when the design 
matrix of selected variables become rank-deficient. 



Bootstrapping pairs When bootstrapping pairs, we require m applications of the regular 
Lasso procedure with different design matrices, so we get a complexity of Oimpn + mk 2 n), 
and since the designs are different, there is no immediate possibility of sharing computations 
between different bootstrap replications 
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Figure 2: Probability of selecting each variable vs. regularization parameter fi (low- 
dimensional setting) for various resampling schemes, before intersecting. White values corre- 
spond to probability equal to one, and black values correspond to probability equal to zero 
(model consistency corresponds to while on the top 8 variables and black on the rest). Top: 
consistency condition of the Lasso is satisfied, Bottom: consistency condition not satistfied. 
Note the similar behavior of resampling noise (which requires knowing the generating distri- 
bution) and the two forms of bootstrapping (which do not). See text for details. 

Bootstrapping residuals When bootstrapping residuals, we first run the Lasso once, with 
complexity 0(pn+k 2 n). Then, for all values of the regularization parameter, naively, we would 
have to run the Lasso m times. In order to avoid running the Lasso as many times as m times 
the number of values of /i we want to consider, one can first notice that there are at most 
0(k) break points in the original Lasso estimation, and that between break points, one has 
to minimize an objective function which is composed of a ^-penalty, a quadratic term and a 
linear term whose coefficients depend affinely in \i. This implies that the path is also piecewise 
linear within this segment and can be followed using an homotopy algorithm very similar to 
the one for the regular Lasso. Thus it makes 0(mpn + mk 2 n) per segments when restarting 
an homotopy method for this segment, i.e., an overall complexity of 0(mkpn + mk 3 n). This 
can be put down by computing a joint path that goes through all 0(k) segments sequentially 
instead of in parallel, in total time 0(mkpn + mk 2 n). Moreover, since when bootstrapping 
residuals, the design matrix is the same for all replications and computations of submatrices 
of Q may be cached, to obtain a complexity of 0(mkpn + k 2 n). 

Similarly, when bootstrapping after projections onto the active set of a single global Lasso 
run, one can get even get a lower complexity of 0(pn + mk 2 n), i.e., one Lasso followed by m 
Lasso on a reduced data set. This requires however updates (when the first Lasso estimation 
switches active sets) such as the ones proposed in [23]. 
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5.2 Experiments - Low-Dimensional Settings 

We first consider a low- dimensional design matrix, with p — 16, n — 1024 and 8 relevant 
variables (i.e., J = {1,...,8}). The design is sampled from a normal distribution with 
independent rows, sampled i.i.d. from a fixed covariance matrix. We consider two covariance 
matrices, one that leads to design matrices which do not satisfy the consistency condition of 
the Lasso in Eq. (2.3), and one that leads to Lasso-consistent design matrices. 

In Figure 2, we plot the marginal probabilities (computed from 512 independent repli- 
cations) of selecting any given of the p = 16 variables for all values of the regularization 
parameter fi and for the various resampling schemes (resampling noise, bootstrapping pairs 
or bootstrapping residuals), without intersecting (i.e., we are just reporting counts from 512 
replications from a single dataset). Note that the left column (resampling noise) exactly corre- 
sponds to the various regimes of the Lasso presented in Section 2 (these require full knowledge 
of the generating distributions and are only displayed for illustration purposes): for large val- 
ues of /i, no variable is selected (Proposition 2.1), then a fixed pattern is selected (/i tending 
to zero faster than n _1//2 , Proposition 2.2), then all patterns including the relevant variables 
(// of order n _1//2 , Propositions 2.4 and 2.5), and finally, for small values of /i, all variables 
are selected (Proposition 2.6). Note that in the top plots, as expected (since Eq. (2.3) is not 
satisfied), some portions of the regularization paths lead to the correct pattern, while in the 
bottom plots, as expected (since Eq. (2.3) is satisfied), there is no consistent model selection. 
It is important to note that using the bootstrap leads to similar behavior than resampling the 
noise, but does not require extra knowledge (i.e., a single dataset is needed). Note finally, that 
bootstrapping residuals does alter slightly the regularization paths — because of the bias of 
the Lasso estimate — and the selected patterns (see other evidence of this behavior in Figure 3 
and Figure 4). 

In Figure 3, we compute the marginal probability of selecting the variables for the Lasso 
(left column) and the various ways of using the Bolasso (boostrapping pairs or residuals), i.e., 
after intersecting. Those are obtained by running the Bolasso with 512 replications, 128 times 
on the same design but with different noisy observations (thus, a total of 512 x 128 Lasso runs 
are used for each of the plots on the middle and right columns of Figure 3). On the top plots, 
the Lasso consistency condition in Eq. (2.3) is satisfied and the two versions of the Bolasso 
increase the width of the consistency region of the Lasso, while on the bottom plots, it is not, 
and the Bolasso creates a consistency region. Note that bootstrapping residuals modifies the 
early parts of the regularization path (i.e., large values of /i), illustrating the effect of the bias 
of the Lasso when bootstrapping residuals. 

In Figure 4, we consider the effect of the number m of bootstrap replications, in the same 
two situations. Increasing m seems always beneficial. Note that (1) when m = 1 (essentially 
the Lasso), we get some strictly positive probabilities of good pattern selection even in the 
inconsistent case, illustrating Proposition 2.4, and (2) if m was too large, some of the relevant 
variables would start to leave the intersection of active sets (but this has not happened in our 
simulations with only 512 replications). 
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Figure 3: Probability of selecting each variable vs. regularization parameter /i (low- 
dimensional setting) for the Lasso (left column) and the Bolasso (middle and right columns). 
White values correspond to probability equal to one, and black values correspond to probabil- 
ity equal to zero (model consistency corresponds to while on the top 8 variables and black on 
the rest). Top: consistency condition of the Lasso is satisfied, Bottom: consistency condition 
not satistfied. See text for details. 
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Figure 4: Probability of correct pattern selection with various numbers m of repli- 
cations in {1 in plain black, 2, 4, 8, 16, 32, 64, 128, 256, all in dashed red, 512 in plain blue} 
(low-dimensional setting). Top: consistency condition of the Lasso is satisfied, Bottom: con- 
sistency condition not satistfied. Note that only one replication (plain black) is very similar 
to the regular Lasso. 
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Figure 5: Probability of selecting each variable vs. regularization parameter /i (high- 
dimensional setting) for various resampling schemes, before intersecting. Only the first 8 
variables and the 16 variables which violates condition in Eq. (2.3) the most are plotted. 
White values correspond to probability equal to one, and black values correspond to proba- 
bility equal to zero (model consistency corresponds to while on the top 8 variables and black 
on the rest). Note the similar behavior of resampling noise (which requires knowing the gen- 
erating distribution) and all forms of bootstrapping (except for bootsrapping pairs, in the 
top-middle plot). 
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5.3 Experiments - High-Dimensional Settings 

We now consider a "high- dimensional" design matrix (i.e. such that p > n), with p = 128, 
n = 64 and 8 relevant variables (i.e., J = {1, . . . ,8}). The design matrix is sampled from 
a normal distribution with i.i.d. elements. For the sampled design matrix, the condition in 
Eq. (2.3) is not satisfied, as for most designs with such p, n and |J|, as shown in Figure 1 in 
Section 4, but assumptions (A5-6) are. 

We performed the same simulations than in Section 5.2, with additional bootstrapping 
procedures, namely after projecting into the original Lasso estimate, with the same regular- 
ization parameter (no consistency result) or with a parameter multiplied by logp (consistency 
result in Proposition 4.2). 

In Figure 5, we consider marginal probabilities before intersection, to study the general 
behavior of various resampling schemes. We see that bootstrapping procedures behave rather 
differently than resampling the noise (unlike in low-dimensional settings), and that boost- 
rapping pairs does lose some of the relevant variables while boostrapping residuals does not. 
After projection, all resampling procedures behave correctly. In Figure 6, we compare the 
Lasso and the Bolasso (for several ways of performing the bootstrap): boostrapping residuals 
consistently leads to better performance. Note that while the top right plot behaves correctly, 
we currently have no proofs for it. In Figure 7, we consider the effect of various numbers of 
replications. Note that in the bottom-right plot, 512 replications are indeed too many (i.e., 
when too many replications are used, we start to lose some of the relevant variables). 

6 Conclusion 

We have presented a detailed analysis of the variable selection properties of a boostrapped 
version of the Lasso. The model estimation procedure, referred to as the Bolasso, is provably 
consistent under general assumptions, in low- dimensional and high-dimensional settings. We 
have considered the two types of bootstrap for linear regression, and have shown empirically 
and theoretically better properties for the bootstrap of residuals. This work brings to light 
that poor variable selection results of the Lasso may be easily enhanced thanks to a simple 
parameter-free resampling procedure. Our contribution also suggests that the use of bootstrap 
samples by L. Breiman in Bagging/Arcing/Random Forests [10] may have been so far slightly 
overlooked and considered a minor feature, while using boostrap samples may actually be 
a key computational feature in such algorithms for good model selection performances, and 
eventually good prediction performances on real datasets. 

The current work could be extended in various ways: first, we have not proved yet that 
bootstrapping residuals, while giving nice empirical performance, is consistent in terms of 
model selection. Second, a similar analysis could be applied to other settings than least-square 
regression with the £ 1 -norm, namely regularization by block £ 1 -norms [39], multiple kernel 
learning [39], more general hierarchical norms [43, 3], and other losses such as general convex 
classification losses; in particular, an extension of our results to well-specified generalized 
linear models is straightforward, as they are locally equivalent to a problem like in Eq. (2.1), 
i.e., locally they are equivalent to minimizing \{w — w) T Q(w — w) — q 1 (w — w) + /i||u>||i, 
with q being random and having as covariance matrix a multiple of Q. 
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Figure 6: Probability of selecting each variable vs. regularization parameter fi (high- 
dimensional setting) for the Lasso (left column) and the Bolasso (middle and right columns). 
White values correspond to probability equal to one, and black values correspond to proba- 
bility equal to zero (model consistency corresponds to while on the top 8 variables and black 
on the rest). See text for details. 
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Figure 7: Probability of correct pattern selection with various numbers m of repli- 
cations in {1 in plain black, 2, 4, 8, 16, 32, 64, 128, 256, all in dashed red, 512 in plain blue} 
(high-dimensional setting). Top: consistency condition of the Lasso is satisfied, Bottom: 
consistency condition not satistfied. Note that only one replication (plain black) is very 
similar to the regular Lasso. 
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Moreover, extensions to general misspecified models or models with heteroscedastic ad- 
ditive noise could be carried through. Also, theoretical and practical connections could be 
made with other work on resampling methods and boosting [11]. In particular, using the 
bootstrap to both select the model and estimate the regular izat ion parameter is clearly of 
interest. Finally, applications of such resampling techniques for signal processing and com- 
pressed sensing [4, 13] remain to be explored, both in the context of basis pursuit (£ 1 -norm 
regularization, [14]) and matching pursuit (greedy selection, [28]). 



A Probability results 

In this appendix, we review concentration inequalities that we will need throughout the proofs. 



A.l Multivariate Berry-Esseen Inequalities 

If Xi, . . . , X n e MP are n independent (but not indentically distributed) random vectors, with 
finite third-order moments, and normalized second-order moments, i.e., such that var(n -1 / 2 
I, then for all convex sets C, we have the multivariate Berry-Esseen inequality [5, 24]: 

where u is a standard normal random vector and Cf E is a universal constant. 

We can also derive from [24] another version for expectation of bounded Lipschitz func- 
tions, i.e, if f(x) is bounded by Mi and Lipschitz, with Lipschitz constant M 2 , then, we 
have: 




(A.2) 



where Cf E is a universal constant. Note that better bounds (with better scalings in p) 
exist in the i.i.d. case [5]. Any improvement on Berry-Esseen inequalities would lead to an 
improvement of our results. 

In this paper, we will consider convex sets corresponding to selecting a given sign pattern 
(among the 3 P available ones), making use of Eq. (A.l). When considering leaving out a 
given variable (like in Appendix D.7), we will design a specific Lipschitz function and apply 
Eq. (A.2). 




A.2 Concentration Inequalities for Subgaussian Variables 

We consider n independent real random variables Y±, . . . ,Y n , which are subgaussian with zero 
mean and uniform subgaussian constant, i.e., there exists r > such that for alH G {1, . . . , n} 
and all seR, E(e sY >) sC e^l 2 . Then, we have [30, 8]: 



P 



-J^Y^t] ^e- nt2 ' 2T \ (A.3) 



n . 
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Note that the variance of Y is always then less than r 2 (with equality if and only if Y is 
normally distributed). We will also use Hoeffding inequality for bounded variables, which 
amounts to use the fact that if \Y\ ^ M, then Y is subgaussian with constant r 2 = M 2 /4 [8, 
30]. 

We will also use concentration inequalities for quadratic forms [38] in independent random 
subgaussian variables, with universal strictly positive constants C% , C%: for all symmetric 
matrices A, if \A\ denotes the matrix of absolute values of elements of A, then 

/ r/yu -2 fq+2 - 4 )\ 
¥(Y T AY-E(Y T AY) > t) ^C x q exp(-min|gp jj . (A.4) 

B Perturbation of positive matrices 

In this appendix, we review known results of perturbation of positive matrices. Let Q and 
R be two positive matrices of size p, A and B two disjoint subsets of {1, . . . ,p} such that 
AUB = {l,...,p}. We have [25]: 

Amin 

iini/2 pi/211 ^ max{A max (Q), Amax(-R)} 1 / 2 
2max{A min (Q), A min (i?)} 

_ ir-V^Ha < 1 — - ||Q-^|| 2 , 

2max{A min (Q), A mill (.R)} d / i 

lin n- 1 r R~ l II ^ H^A^zjd^llj i ^ x (Ra,a) 1/2 \ ln R n 

II^B^gB-^B-rtBBl^ ^ r 7-^ r h~ 1~ \\Qb,B ~ H B ,B || S - 

C Optimization lemmas 

The following three lemmas give error bounds on the Lasso estimates and conditions for a 
sign pattern s G {—1, 0, 1} P to be the one of the unique solution w to Eq. (1.1) or Eq. (2.1). 

Lemma C.l. Assume (Al) and (A4). Let s G {0,-1, 1} P and J = {j,sj ^ 0}. Then s is 
selected (i.e., sign(u)) = s) if and only if: 

WQj^jQjjQj ~ Qjc ~ Qjc,jcWjc - fjQjc tJ QjjSj ||oc < /i, (C.l) 
sign(wj + Qj^qj - hQj^jSj) = sj. (C.2) 

The solution then satisfies wj = wj + Qj]j(qj — /xsj). 

Proof. Following standard results in non-smooth convex optimization [9, 7], w is optimal 
for Eq. (1.1) or Eq. (2.1), if and only if, for all j G {l,...,p} such that Wj ^ 0, then, 
[Q(w — w)]j — qj + yusign(w;j) = 0, and for all other j, \[Q(w — w)]j — qj\ ^ /x. We thus get 
wj = wj + Qj]j(qj — fisj), and the result follows from expressing that wj should have the 
right sign on J — Eq. (C.2) — and that the directional derivatives along other directions are 
positive— Eq. (C.l). □ 
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When the sign pattern is consistent on J with w, then we can further refine the conditions 
of Lemma C.l: 

Lemma C.2. Assume (AX) and (A4). Let s £ {0,-1, 1} P such that sj = sign(wj) and let 
J = {ji Sj ^ 0} D 3 . Then s is selected if and only if: 

\\Qj c ,jQj!jQJ - <?J C - l*Qj°,jQjj8j Hoc ^ A*, (C.3) 
sign(wj + (Qj^qj - vQj^jSj)j) = sign(wj), (C.4) 
sign(Q7 i 1 J gj - hQjjSj)j\j = s A j. (C.5) 

The solution then satisfies wj = wj + Qjj(qj — fJ-sj). 

Lemma C.3. Assume (Al) and (A4). We have ||u>-w|| 2 < ^feffi and IIQ 1/2 (^-w) || 2 < 

up ■ 

Proof. From optimality conditions, we have — w) — qW^ «C /i, from which we get 

\\Q 1/2 (w-w)\\ 2 ^ \mm(Q)~ 1/2 {\\Q(w - w) - q\\ 2 + ||g|| 2 ). The results follow from the identity 
IMh ^ p 1 / 2 1| a || oo for any a £ W. □ 

The following lemma relates the solutions of Eq. (2.1) for different values of Q and q. This 
will be used in Appendix D.7 to prove the Lipschitz continuity of the solution of Eq. (2.1) as 
a function of q. 

Lemma C.4. If w is solution of Eq. (2.1) for Q,q, and w' is solution for Q',q', then we 
have: 

\\QW{w - w')\\^2\\Q-\q - q>)\\ 2 + 2 " ^~ ^'^'^ " 2 \Wh] ■ 

Let 7 = Q^-™-Q~\) ! then if Q = Q> ; || 7 _ y|| 2 ^ W-Hji-rth f and if q = q > } then 

Hfv-iAv rn'W2VII < 2 ll(Q , )" 1/2 (Q-Q , )Q" 1/2 lla [-1/2.. , 1 
HQ ' ^ II2 ^ ^u-^^a— (QOV^ ^ M + 119 II2J ■ 

Proof. We let denote </(u>) = |(u> — w) T Q(tf — w) — g T (u; — w) + //||io||i the Lasso cost 
function. A short calculation shows that for all z such that z T Qz = 1, J(w' + az) — J(w') is 
larger than 

Y - (ll<T 1/2 (<? - ?0II 2 + IKQT'V - w)|| 2 ||(go- 1/2 (Q - Q')Q- 1/2 h) a. 

If the last expression is nonnegative, since J is convex, the (unique, because Q is invertible) 
minimum w of J must occur within the convex set {w, \\Q 1 ^ 2 (w — w') || 2 =Ss «}• The first 
result follows, using Lemma C.3. Other results are direct consequences of using results from 
Appendix B. □ 
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D Proofs for low-dimensional results 

Note that assumption (A3) implies a bound on the largest eigenvalue of the matrix Q, i.e., 
A m ax(<5) ^ PlIQIloo ^ pM 2 . Moreover, we have for all J C {l,...,p}, k G {l,...,n} and 
j G J c : 

„ „ i w M i T , 1/2w 2Af|J| 1 / 2 



which leads to 



A^Q) 1 ^ 1 1 ^ A 1 / 2 



P(||9J" - Qj^jQjjQjWoo > tMa) ^ 2pexp ] • (D.l) 

D.l Proof of Proposition 2.1 

The null vector w = is solution of Eq. (2.1), if and only if HQw + gHoo ^ /i, which is the case, 
as soon as fi ^ M 2 ||w||i + ||g||oo (because HQwjloo ^ M 2 ||w||i), and, thus with the additional 
assumption fi ^ 2M 2 ||w||i, as soon as ||<z||oo ^ m/2. We have, by the union bound: 

p / ~2\ 

g||oo < n/2) > 1 - |>(|<&| > V/ 2 ) > 1 " 2pexp -|U , 
j=l v / 

because we have E(e SXijei ) ^ e s 2 r 2 M 2 /2 f or a n s e j£ anc i j g . . , an d by Eq. (A. 3). 
D.2 Proof of Proposition 2.2 

If J(w) = \{w — w) T Q(w — w) — q J (w — w) + /i||u>||i is the Lasso cost function, we have, for 
all z G MP such that ||z||2 = 1 and a > 0, 

J(w + az) ^ J(w ) + X min (Q)a 2 /2 - q T az + ((j,-(i )(\\w +az\\i- ||wo||i), 
^ J(w Q ) + A min (Q)a 2 /2 - a(||g|| 2 + |/i - /i |p 1/2 ), 

which implies — Wolh ^ 2A min ((5)~ 1 ||g||2 + 2|/x — /io|A m in( ( 5)~ 1 J' 1 ^ 2 - The first inequality 
follows from P(||g|| 2 ^ t) < 2pexp(-t 2 n/2pM 2 r 2 ), applied with t = A min (Q)/3(x/4M. 

We let denote s and J the sign and support patterns of v. We have from optimality 
conditions of the noiseless problem, (u>o)j = wj- /ioQj jSj and ||(<5(wo — w ))j c ||oo ^ fio—r]Ma. 
We now need sufficient conditions for Eq. (C.l) and Eq. (C.2) in Lemma C.l. For Eq. (C.2), we 

need that sign(K) J + g7>+( / , -/i)Q7,>j) = aj. If \^-fi \ < ^^°) = ^MW , 
then 

|| (/x - aOQ7,j-sj||oo ^ |/x - Aio|A mi n(Qry /2 < m(w )/2, 
and then Eq. (C.2) is satisfied as soon as {Qj]jqj)jSj ^ — m(u>o)/2, for all j G J, which occurs 
with probability greater than 1 — p exp ( ~ nm (^oM/o-) a 
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For Eq. (C.l), we assume that \\qjc — Qj^ t jQjjqj\\oo ^ r]M<j/2, which occurs with proba- 
bility obtained from Eq. (D.l) (with t = rj/2). Also, |^-^| ^ < ^ f^jl* < 

P Vn(Q) 1/2 

Mcrri/2 mi- • i- 

v-^-t? — . lhis implies 

\\qjc ~ Qjc,jQjjqj\\ 00 < vMa - |/x - /x |(l + p^M^X^Q)- 1 ' 2 ), 

\\qj° ~ Qj°,jQjjqj\\oo + IIQK - w)||oo + \n- fi \p 1/2 M^ 2 X min (Q)~ 1/2 < ^ 
because ||Qj jSj||oo ^ -^p 1 ' 2 A m i n ((3)~ 1/ ' 2 ; hence the desired result. 

D.3 Proof of Proposition 2.3 

Note that fi < m(w ^7 2 in(Q) , and || A|| 2 ^ A min (Q) - V /2 implies that sign(wj+/iAj) = sign(wj). 
Thus, if ||z|| 2 = 1 and a > 0, we have: 

J(w + //A + az) ^ J(w + /jA) + X mhl (Q)a 2 /2 - q T az + (fiA) T Qaz + 

A*(||w + /xA + a^Hi - ||w + /uA||i), 
^ J(w + //A) + A min (Q)a 2 /2 - g T az, 

which implies \\w — w — yuA|| 2 ^ 2A m i n (<5) _1 ||<2i|2, ans thus the first inequality. 

We let denote s the sign pattern of A and J its support. Since, by assumption \x ^ 2Y/2 , 
we have /-t||<3j}sj)j||oo ^ m(w)/2, if || (<3j j<7j)j||2 ^ | m ( w ), which occurs with probability 
greater than 1 — 2|J| exp(— n m ^ f \ X p n ), then Eq. (C.4) is satisfied. 

If ||gj c — QjcjQj^qjWoo ^ A 47 ?, then Eq. (C.3) is satisfied, and this occurs with prob- 
ability greater than 1 — 2pexp ^— ^pj-rj (from Eq. (D.l)). Finally, if for all j G J\J, 
(Qjjlj)j s j ^ "A 4 1 A? I' then Eq. (C.5) follows. This occurs with probability greater than 
1 — pexp(— X 2 m(M 2 A) 2 fi 2 n/2f 2 p). The result follows by the union bound. 

D.4 Proof of Proposition 2.4 

The optimality condition in Eq. (C.4) from Lemma C.2 is satisfied as long as p ^ ^r/r ? an d 

\\{Qj jQj)j\\2 ^ | m ( w )j which occurs with probability greater than 1— 2|J| exp(— nm(w)A 2 n/8f 2 p), 
while the intersection of events in Eq. (C.3) and Eq. (C.5), by the Berry-Esseen inequalities, 
converges to the probability that P(it G C), where u is normal with zero mean and covariance 
matrix Q and C is the convex set defined as the intersection of 

{ [(Q7>j) AJ - fin^a^iQj^s)^} o s AJ > 0} 

and 

{\\ujc - QjcjQjjUj - pn xl2 a~ x Q jcjQjjS j || oo ^ pn^a' 1 } . 
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The set C and its complement have non-empty interior and since Q is full-rank, the probability 
is strictly inside the interval (0, 1). Moreover, by Eq. (A.l), the error bound is upperbounded 
by Cf E times 



p 1 ' 2 

n l/2 



p 1/2 4Mp 1/2 r 3 p 2 4f 3 

because E|ej| 3 ^ 4r 3 , which leads to the desired result. 
D.5 Proof of Proposition 2.5 

We simply use Lemma C.3: — w|| 2 ^ p [ m *q^ 2 ■ Thus if m(w) > p ^ ^ + ^ 2 , the result 
follows from concentration inequalities in Appendix A. 

D.6 Proof of Proposition 2.6 

We have, by considering all patterns consistent with the total absence of zeros: 
P(3j G {1, . . .,p},Wj = 0) < ^ P(sign(u>) = s). 

s,3iG{l,...,p},^=0 

We now consider such a pattern and its support (strictly included in {l,...,p}). From 
optimality conditions in Eq. (C.l), we get that sign(w)) = s implies that \\qj — Qj,jQj]jqj — 
nQjijQj j s j||oo ^ fJ> for some j E J c ^ 0. Note that the covariance matrix of qj — Qj,jQjjqj 
is equal to a 2 Qjj\j/n and has a lowest eigenvalue greater than cr 2 \ m m(Q)/n. Thus, by the 
Berry-Esseen inequality, 

P(sign(^) = 8 )^ cf E i^4n + ~^r, 

which implies the desired result, since there are at most 3 P allowed patterns. 

We can get a better bound (with respect to p), but with a weaker dependence in /i, that 
is, we consider: 

F{3j e {1, . . . ,p}, Wj = 0) < P(3j G J, ^ = 0) + P(3j G J c , Wj = 0). 

The first term is upper bounded by 2| J| exp(— nm(w)A 2 ?7,/8f 2 p), while the second one is 
upper-bounded using Proposition 2.7. This leads to the global desired upper bound, which 
scales better in p but worse in n. In particular, it requires that \in tends to infinity, i.e., \i is 
not too small. 
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D.7 Proof of Proposition 2.7 

We first start with a simple elementary lemma: 

Lemma D.l. Ifu^Wisa standard normal random variable, then 

a > ¥(\u - /3\ ^ a) ^ -^— e -P 2 / 2 . 

1 + a 

When minimizing Eq. (2.1), with the constraint that Wj = 0, we get the solution (with 
the notation j c = {j} c ): 



Wjc 



for a certain Tigfe' ) such that IIQ^jctL q(?j c )IIoo ^ 1- It is optimal for the full problem if 
and only if (because Wj = 0), \Qjjc(wjc — w jC ) — qj\ ^ /i, i.e., 

\-Qj + Qj-j Qy \y <iy + vQjjcQj-jh^Qiqj^l ^ 

We consider the "soft indicator" function (triangle-shaped) f^giq) of q defined as 



The function f 3 q is upper bounded by 1, moreover, from Lemma C.4, 7^q is Lipschitz with 
" /i~ 1 A (Q)i/2 • Thus, f^Q is Lipshitz with constant 2//~ 1 A M ^qy/2 + -^-W ^ 



-1 M 



5/1 A min (Q)V2 • 

Moreover (by design) we have 



thus E/^g(g) ^ P(j ^ J). This implies by the Berry-Esseen bound (see Appendix A.l), that, 
if qc denotes the Gaussian approximation: 

BE p 1/2 [^P 1/2 n-^ 2 \ 4f 3 p 3/2 



^v 1 ' 2 fbv 1 ' 2 ^ 1 ' 2 \ 



A 1 / 2 



because the average third order moment of the normalized variable is equal to 4T -^/i and 



the Lipshitz constant of the function of the normalized variable is equal to 

n -l/2 ap V2 M = 

Moreover, we can lower bound, for any q, 

E/£g(g) > - '/, + Q,, O.y \y <ij + ^QJ^Q^ < /' 2 )" 

When applied to the Gaussian limiting distribution q&, we know that the random variable 
n 1 / 2 cr _1 (— qj + QjjcQjc^jcqjc) is asymptotically normal with mean zero and covariance n 2 = 
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Qj,j\j c - We get by applying Lemma D.l with (3 = MW 1 ° — Qj,j c Qj^1'l,Q( ( lj c ) an d a = M " 1 ° — 
which are such that 1/3 1 < ^ 1/2 ^ W /2 : 



E[/ig ((to) I 



2k 



cxp 



MA min (g)-V 



which leads to, with k ^ M and /c ^ A m i n (Q) 



1/2 



E/ig(?G) ^ 



2M 



1/2,-1 O^P 



2A min (Q) 1 /2 



fj, 2 np M 2 

' 2a 2 A min (g) 2 J • 



Similarly, we can get an upper bound on the probability of not selecting the variable j. 
We consider the same technique, but we now need to upperbound a probability of the type 
E/^g(g G ), which leads to the desired result. 



D.8 Proof of Proposition 3.1 

Following the analysis in Section 3, we need to upper bound P(j G J*) m and P((J*) C U J ^ 0). 



WeobtainP((J*) c UJ ^ 0) ^ 2pexp 
we get 



m(w)A J 



n ) from Proposition 2.5. From Proposition 2.7, 



P(j e J*) ^ 



/in 1 / 2 /4 



/i 2 

exp I — n» 

1 + /271V2/2AV2 F V 2A 2 



10C 2 BE 



r 3A x /inp 



1/2 



4C BE-3 p 5/2 

AV2 



We let h(c) 



1 c/4 



-fH> and #(c) 



2 1+C/2A 1 / 2 

log(l — h(c)), to get the desired result. 



8Cf E i 
f3\i c 



ACS 



h(c) 2 , and /(c) 



D.9 Proof of Proposition 3.2 



10C? t i . AC. 



Using the same reasoning as in Appendix D.8, we get the same f(c) and a(c) = -^fi \ 



2 



3 A 1 C 1 f3 A l/2' 



E Proofs for boostrapping pairs 
E.l Concentration inequalities 

We now assume that we have a bootstrap sample X* and y*, which leads to Q* and q*. We 
now derive concentration inequalities for q* and Q*, that we use in Appendix E.2. 

For all a, b G {1, . . . ,p}, Q* ab is an average of variables bounded by M 2 . Thus, by Hoeffd- 
ing's inequality [8] and the union bound: 

m\Q* - Q\\oo>tM 2 ) «C 2p 2 exp (-2nt 2 ) . (E.l) 
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Similarly, we bound the deviation between q and q*: 



q ~ q*\\oo > tMa\e) sC 2pexp ( -2nt 2 -^— ) . (E.2) 

If II oo , 



Also, by the central limit theorem, given e, n 1//2 (g* — q) converges in distribution to a normal 
variable with mean zero and covariance matrix 



1 n 

a 2 Q = E [( £ *) 2 a;*(a;*) T |£] - E[e*x*|e]E[e*a;*|e] T = -^e 2 x iX J - gg T 



i=l 



We can derive concentration inequalities of Q around Q, by using Appendix A. 2 and Eq. (A. 4): 
Lemma E.l. Assume (Al-4). We have: 

^tn CM 2 )\ ( nt 



n\\Q-Q\\oo> 



tM 2 ) <2p 2 C , 1 q expf-minj^-, \\ +2pexp 



2f 2 



Proof. From Eq. (A. 4) applied with a diagonal matrix for each pair of coordinates a, b (and 
using the union bound): 



P 



^ iX>- 2 e 2 x 4 x7-Q ^tM 2 j^2p 2 C'^xp^-min|^ 



r 2 ' f 4 



If we use the inequality P(||g||oo > z) ^ 2pexp(-nz 2 /2M 2 t 2 ), with 2 = (t/2)^ 2 <rM } we 
get the desired result. □ 

E.2 Proof of Proposition 3.3 

Following the analysis from Section 3, we need to upper bound P(j G J*\s) (probability of 
including a certain irrelevant variable into one of the replicated active sets), and P((J*) C U J 7^ 
0) (probability of missing none of the relevant variables). We first prove two lemmas about 
each of them. 

Lemma E.2. Assume (Al-4), h ^ and 2 > 256 ;' . We /iawe: 

' /> r< ^ 2pV2 p ^ m(w) 2 A 2 

m// » , , s 9 I \ 2 n\ 1/9 / m(w)An 1//2 

P((J*) C U J ^ 0) < 2p 2 exp ( - T -J + Spn^exp I -_ LJ--^ 

Proof. This lemma shows that all relevant variables will be selected with overwhelming 
probability. From Lemma C.3, we have that J C J* as soon as \\w — wlU ^ p . ^l^J 2 ■ 

Thus, if m(w) > ^gy, A min (Q*) ^ A min (Q)/2, ||g - g*|| 2 sC m(w)A min (Q)/8, and ||g|| 3 < 
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m(w)A min ((5)/8, then J C J*. Thus, we have: 

p((j*) c u j + 0) < p ( A min (g*) < +P ( \\ qh ^ m ( w ) A m in(g) 



+p(lk-gl2^ m(w)Amin(Q) 



2p 2 ex ) + 2p exp 

+2pE exp 



n\ 2 \ ( nm(w) 2 A 2 



2p 2 ) ' r ~~ r \ 128pf 2 

nm(w) 2 X 2 a 2 



32p\\e 



I 2 

CO 



We thus need to bound, for some A > 0, 

Eex P (-pfe) = Eexp (-pfc) llNU^ + Eexp ( -±) 1 Moo> m, 

< exp(-^) +P(||e|| 00 > M), 

< exp + 2nexp(-f$) ^ 3^ exp(-^), 

for A/M 2 = A 1 / 2 /r2 1 / 2 + log(n 1/2 ), leading to 

^ / nm(w) 2 A 2 a 2 \ 1/9 / n 1 / 2 m(w)A 

2pEexp (—^Bir J * 6pn exp i"^ 2 - 

The condition n > 256 y„f„ allows to combine two terms into one, leading to the desired 
result. □ 

Lemma E.3. Assume (Al-4) and j G J c . Moreover, assume that \\q\\oo ^ ftM<r/2 and 
\\Q - Q\\oo ^ with ft ^ /i, ftft < and ft ^ A. We have: 



/in 1 / 2 



P(jg J*\e)> : 32 „ 2 exp 



1 + 



4AV 2 



1 f S/in 1 / 2 ^ 1 / 2 |fc - QjjcQj^jcqj, 



16C BE p 5/2 10C BE p 2 / ^2 \ / n ft2 



W /in f3AV 2 nV 2 -^^"ipr ' exp 



OG 



Proof. We follow the same approach as in the proof of Proposition 2.7 in Appendix D.7. We 
first assume that \\q — q*\\oo ^ ftMcx/2 and \\Q* — Q\\oo ^ ftM 2 /2 (on top of the assumptions 
made on Q and q). Following the same reasoning as in Appendix D.7, j is not included if 

\-q* + Q^CQ-cjc)- 1 ^ + ^V(Q*c )jc )- 1/2 7/i ,Q*(^)| < /i. 
In order to apply Berry-Esseen inequality given s, we first need to upper bound IQj^Qjcjc) -1 ^ 
QjJ c Qj^j c Qj c \: using Appendix B, by 
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Also, we need to bound, by Lemma C.4 and Appendix B: 



:A I 1 + — + -— — < V P2^ 

A A At / 



5 

A 2 " 



i + 4 



~X 2 fl 



Since /3i ^ jl, flifa ^ J^, and fa ^ A, we thus have: 



Q;,j(Q}.y) % ~ Qj.rQy-.y'h < /'• 

If we let denote A the event {| - q* + (),.,. (),-. [ f q), + li-Qj^Qlt^^ Q {q)o)\ ^ /i/2} and 5 the 
event - g*||oo piMa/2} n {||Q* - Q||oo faM 2 /2}, this implies that 

P(j £ J*\e) > ¥{A\e) - ¥{B c \e). (E.3) 

We have, by concentration inequalities from Appendix E.l: 

nfia 2 ' 



P(5 C |£) < 2pexp 



4||e| 



2p exp 



(E.4) 



Overall, if we assume the various bounds on q, q*, Q* and Q, to have j excluded from the 
active set for the bootstrap sample, it is sufficient that A is satisfied. As in Appendix D.7, 
we consider a smooth version of the indicator function, and we get that the probability of A, 
given e, is greater than 



(|M i -g iJc g i 4 c M ic -/ig jJ cg jc ^ 7 ^ Q (M i c)| ^ h/4)-r 



(E.5) 



where u is normal with mean q and covariance matrix a 2 Q/n, and, from Proposition 2.7, 
R ^ 16 f| 1 + -3^1/2 ^72 (note that we have used that Q is close to Q). 

We have that given Ujc, —uj + Qjj^QjcjcUjc is normal with mean — qj + Qj,jcQj}j C qjc and 
covariance matrix a 2 Qjj\jc/n. Thus, we get, using Lemma D.l: 



2 

-EP 
2 



Q ,.jQy \y " r ■ i<Qj. r QrJ" I.q ( ) < /V 4 



(E.6) 



n j + Qj,j c Qj^,j^ u j c + ^Qj,j c Q 'jc/jci^Q(ujc)\ < m/4|uj- 



M/4 



2<rn- 1 /2Q 1 /2 



21 + 



exp 



M /4 



2an- 1 /2Q 1 /2 



j,j c (g + . 



- 1/2 7j,Q(^ 



an 



-V 2 g 1/2 . c 



?j Qj,j c Qj c ,j cC ij c 



an 
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We have using our assumptions regarding q and Q: A min (Q)/2 ^ Qjj\j c ^ 2M 2 and \Qjjc{Qj° 
MA min (Q)~ 1/2 p 1/2 . Moreover, 



-1/2 rJ 



\Qjj c Qj c ,j c ij c Qj,j c Qj c ,j c ij c \ ^ 



p x l 2 fixM<j (p l ' 2 (3 2 M 2 AM 3 pp. 



Amin (Q) A min (Q) 3 / 2 



ig^ 2 - q-{£| < 4A min (g)- 3 / 2 ng - Q|| 2 < 



3p 3 / 2 M 3 /3 2 ftAf<r 

3/2 | 



2/5 2 M 2 

A min (g) 3 / 2 ' 



This leads to a lower bound of the form: 



// n 



1/2 



32 



1 + 



4AV2 



1 

2 



ijln 1/2 p 1/2 \qj - Q,,r Q r , r q, 



A 



an 



-v 2 g 1/2 . c 



(E.7) 



By combining Eq. (E.3), Eq. (E.4), Eq. (E.5), Eq. (E.6) and Eq. (E.7), we get the desired 
result. □ 

We can now consider the full bound using the analysis outlined in Section 3, using 
Lemma E.l, E.2 and E.3: 

P(J n ^ J) < mP((J*) c UJ^0) + ^ E(P(j G J*|e) m ), 
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^ 2j> mexp 
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We consider A = up i^ 3 / 10 and /3 2 = p l n 3 / 10 . We truncate lldloo at n 1//10 cr and — Trfrm 

an 1 ' 2 Q .'. .. 

at z and use Berry-Esseen inequality for g^c, to obtain: 

n l/2 
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All these constraints lead to the constraint tha np 6 should be larger than a function of c. 
We now need to optimize over z the following quantity: 



p i 



Ao(c) 



P 



If we select z such that §p + z = I 2 log 
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cxp 



— — + z 
2 V A 
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enough, i.e., if m ^ * •* (^i^r) 2 



1/2 



which is possible if m and n large 



and n ^ e*- * ^ (74 3 (c)p 3 ) 2 (^f7y-) 2 , then we have the bound: 



1,8c 
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exp 2 log 
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A 3 (c)p 3 n-V2 + JSK2 



This leads to the desired bound. 



F Proofs for bootstrapping residuals 

We use the following notation for the solution of the Lasso: w — w = Q~ l q + fia, where 
WQ&Woo ^ 1. We also denote U x = X(X T X)- 1 X T E R nxn the projection matrix on the 
data, which leads to the following expression for the non-centered estimated residuals: 

e = y — Xw = X(w — w) + e = (I — Ilx)e — jiXa. 

We let denote v = ~ Y^=i £»■ The boostrapped responses are thus y* = e> — v + w T Xi. The 
bootstrapped residuals are thus yi* + (w — w) T Xj, i.e.: 

e* = [Yl x e + + - v. 
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We have the following expectations: 



E(e fc .|e) = 
E(e*|e) = U x e + Xfia 



n 



-l T i = -1 T (I - U x )e - ^l T Xa = v, 
n n n 



var(etle) = var(e fc » |e) = — e T e — ti 2 = — e T (l — Tl x )e + u 2 a T Qa — v 1 

n n 



1 " 

E(q*\e) = - E(e fc « \e)x k = q + fiQa, 



k=l 



Q = vai(q*\e) = \ V] var(e fc . \e)x k xj = — 
k=i 



-e 1 (I - U x )e+/j, ce Q6t- ti- 



ll 



We let denote 7 = 2 -^-e T (l — U x )e + a 2 n 2 a T Qa — a 2 v 2 so that var(g*|e) = a 2 ^Qjn. 
F.l Concentration inequalities 

We need concentration inequalities for q* around q (given e) and of s around 1, as well as v 
around zero. 

Lemma F.l. Assume (Al-4) and t ^ 2 j^. We have: 



ti\ ^ to) ^ 2 exp 



-nt 2 X 
32p 2 f 2 



Proof. We have: ±1 T (I-Tl x )e = ± Y? i=1 ei[(I-U x )l]i with [(I-U x )l]i = 1-xjQ" 1 (^ELiO 
is such that 



I [(I - U x )l]i\ ^ 1 + \ mhl (Qr l M 2 p ^ 2\ min {QY l M 2 P = 5. 

A 



Thus, we get: 



P 
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toj<2«p y gr2MV j=2exp 



-nt 2 A 
8p 2 f 2 



We also have |^l T Xa| = |/xn 1 Efc=i ^Z^l ^ l^pM\ m i u {Q) 1 , hence the desired result with 
the extra condition on t. □ 



Lemma F.2. Assume (Al-4). We have: 



* — q — fiQ&Woo ^ tMa\e) ^ 2pexp 



-2nt 2 
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Proof. We have q* = q + fiQa + - Y^i=i x i^i* • Moreover, we have 

M 2 p 



^ We L ^ 1 



A 



Amin(Q) 

We get the result by Hoeffding's inequality. 

Lemma F.3. Assume (Al-4), | ^ | and p^- ^ £/4. IFe /iawe: 
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Proof. We first need to derive concentration for -e (I — using Eq. (A. 4). 



A max (i|I - n x |) ^ i + £ PxlU < i + ^ and p - n x ||| 
^\xj Q~ 1 Xj\ =Sj p/nX. We thus obtain from Eq. (A. 4): 



-, because 



We have: 
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Together with a T Qa ^ p/A m in(<3), we get the desired result. 



□ 



F.2 Proof of Proposition 3.4 

Following the analysis from Section 3, we need to upper bound P(j G J*|e) (probability of 
including a certain irrelevant variable into one of the replicated active sets), and P((J*) C U J ^ 
0) (probability of missing none of the relevant variables). We first prove two lemmas about 
each of them. 

Lemma F.4. Assume (Al-4) and p, ^ We have: 

nm(w) 2 A 2 \ ( n 1//2 

P((J ) UJ^ 0)< 2pexp ( o \J 2 ) +2nexp 



32pf 



— nA 2 m(w^ 2 
- /i -r ^ u/2 
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; / 



2 f2 p l/2 
/ 

+ 2p exp 

Proof. This lemma shows that all relevant variables will be selected with overwhelming prob- 
ability. From Lemma C.3, we have that J C J* as soon as \\w — wlU ^ v ^ M+ JnJ 2 ■ Thus, 

if m(w) > ^£jjy, ||g - q*h ^ m(w)A min (Q)/4 and ||g|| 2 ^ m(w)A min (Q)/4, then J C J*. 
Thus, we have (using results from Appendix F.l): 



P((J*) C UJ^ 0)< P ||g|| 2 ^ 



m(w)A min (Q) 



^ 2pexp 
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If we truncate ||e||oo at ero 1//4 p x / 4 , then we have the bound 
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exp 



n 1 ' 2 \ 
2f 2 p 1 / 2 ) 
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2p exp 
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hence the desired result. 

Lemma F.5. Assume (Al-4) and j e J c . VFe /iave: 



□ 
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Proof. We follow the same approach as in the proof of Proposition 2.7 in Appendix D.7 
and of Proposition 3.3 in Appendix E.2: j is not included if (note that Q = sQ and Q are 
proportional matrices) 

As before, we consider a smooth version of the indicator function, and we get that the prob- 
ability of not selecting j, given e, is greater than 



^P [luj-QjjcQ^Ujc-^QjjcQj^c^Qiujc)] n/A\-R 



(F.l) 



where u is normal with mean q and covariance matrix a 2 Q/n, and, from Proposition 2.7, 

jd < 16C 2 BE p 5 / 2 , 10C 2 BE p 2 

We have that given Ujo, —Uj + Qjj^QjcjcUjc is normal with mean — qj + Qj s joQj c - qjc and 
covariance matrix a 2 Qjj\jc/n. Thus, we get, using Lemma D.l: 
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By combining Eq. (F.l) and Eq. (F.2), we get the desired result. Note that if \s — 1| ^ 1/2, 
we have 



and 
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□ 



We can now consider the full bound using the analysis outlined in Section 3, using 
Lemma F.4, F.5, and Appendix F.l, together with truncating on the events ||£||oo ^ ero 1//4 
and |7 — 1| ^ n^ 1 / 3 (note that we can apply from Lemma F.3 for n large enough). First, we 
need a bound on 
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We have following the reasoning from Section 3: 
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p 3 16C 2 BE p 3 10C 2 BE p 2 
^4 2 (c)p- 1/2 ^ |s- and 5 < 2fin 1 ' 2 p 1 ' 2 . 

1 + 4AV2 

All these constraints lead to the constraint tha np~ 6 should be larger than a function of c. 
The rest of proof follows along the lines of the proof of Proposition 3.3 (note that the term 
e~ z ( 1- *)/ 2 instead of e~ z 7/2 only affects the constant A§). 

G Proofs of high-dimensional results 
G.l Proof of Proposition 4.1 

From Lemma C.2, we obtain optimality conditions for the solution of Eq. (2.1) to have the 
sign pattern t: 

IIQl^lQl.L^L — <?L C — /^Ql^lQl.L^L II 00 ^ A 1 ) 

sign(wj + (<5l,l9l - ^QZ i 1 L t L )j) = tj, 
signKQZi^L - nQl^t-L)*] = t K - 

It is thus sufficient for t to be the sign pattern that 

Vfc e L c , IQk^QZ^ ~ 9*1 < M (G- 1 ) 
VfceJ, |(QEi«L)*| < ^m(w), (G.2) 
VfcGK, \{QZlqM^liOQ^. (G.3) 

Eq. (G.l) occurs with probability greater than 1 — 2|L C | exp ^— "g f f|^j L j- Eq. (G.2) occurs 
with probability greater than 1 — 2|J| exp ^— n ™^ L ^ L j , and Eq. (G.3) occurs with probability 
greater than 1 — 2|K| exp ^~ ~ff^p V This leads to the desired result by the union bound. 

G.2 Proof of Proposition 4.2 

The bound is obtained simply from Proposition 4.1, Proposition Proposition 3.3 and Propo- 
sition Proposition 3.4, using the union bound. 
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